Finite State Models for the Generation of Large Corpora of Natural Language Texts
نویسندگان
چکیده
Natural languages are probably one of the most common type of input for text processing algorithms. Therefore, it is often desirable to have a large training/testing set of input of this kind, especially when dealing with algorithms tuned for natural language texts. The problem in creating good corpora is that often natural language texts are too short with respect to the dimension required to test effectively the goodness of text processing algorithms, such as string matching and compression algorithms. This is, for instance, the case of the wellknown Canterbury Corpus [RB97], used for testing lossless data compression algorithms, which contains natural language texts with a relative small dimension of not more than 500Kb. The only exception is the “King James Version of the Bible” (approximately 3, 85Mb) contained in the Large Corpus [RB97]. On the other hand corpora of non-textual data contain test files with dimensions up to 3Mb (like the Protein Corpus [NMW99] and the Silesia Corpus [D03]), while testing on random texts is often performed on buffers of dimension 10Mb [ACR99] and 20Mb [CF04]. In many cases the problem due to the lack of big corpus of natural language texts can be solved by simply concatenating a set of collected texts, even with heterogeneous contexts and by different authors. This is the case, for example, of The Linguistic Data Consortium (http://www.ldc.upenn.edu), an open consortium of universities which creates, collects and distributes speech and text databases and other resources for research and development purposes. However, in this context, the task of being able to automatically generate texts which maintain properties of real texts is appealing. In this note we present a preliminary study on a finite state model for text generation which maintains statistical and structural characteristics of natural language texts, i.e., Zipf’s law [Z32] and inverse-rank power law [CF03], thus providing a very good approximation for testing purposes.
منابع مشابه
Syntactic Complexity of Russian Unified State Exam Texts in English: A Study on Reliability and Validity
In this study we analyze texts used in Russian Unified State Exam on English language. Texts that formed small research corpora were retrieved from 2 resources: official USE database as a reference point, and popular website used by pupils for USE training “Neznaika” (https://neznaika.pro/). The size of two corpora is balanced: USE has 11934 tokens and “Neznaika” - 11918 tokens. We share Biber’...
متن کاملAdding Syntax to Dynamic Programming for Aligning Comparable Texts for the Generation of Paraphrases
Multiple sequence alignment techniques have recently gained popularity in the Natural Language community, especially for tasks such as machine translation, text generation, and paraphrase identification. Prior work falls into two categories, depending on the type of input used: (a) parallel corpora (e.g., multiple translations of the same text) or (b) comparable texts (non-parallel but on the s...
متن کاملArabic Entity Graph Extraction Using Morphology, Finite State Machines, and Graph Transformations
Research on automatic recognition of named entities from Arabic text uses techniques that work well for the Latin based languages such as local grammars, statistical learning models, pattern matching, and rule-based techniques. These techniques boost their results by using application specific corpora, parallel language corpora, and morphological stemming analysis. We propose a method for extra...
متن کاملTraining Neural Network Language Models on Very Large Corpora
Published in Joint Conference HLT/EMNLP, pages 201–208, oct 2005 During the last years there has been growing interest in using neural networks for language modeling. In contrast to the well known back-off n-gram language models, the neural network approach attempts to overcome the data sparseness problem by performing the estimation in a continuous space. This type of language model was mostly...
متن کاملVocabulary Lists for EAP and Conversation Students
Despite the abundance of research investigating general and academic vocabularies and developing dozens of word lists, few studies have compared academic vocabulary with general service word lists such as conversation vocabulary. Many EAP researchers assume that university students need to know all the words in West’s (1953) General Service List (GSL) as a prerequisite to academic words (e.g., ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2008